A distribution information sharing federated learning approach for medical image data

In recent years, federated learning has been believed to play a considerable role in cross-silo scenarios (e.g., medical institutions) due to its privacy-preserving properties. However, the non-IID problem in federated learning between medical institutions is common, which degrades the performance of traditional federated learning algorithms. To overcome the performance degradation problem, a novelty distribution information sharing federated learning approach (FedDIS) to medical image classification is proposed that reduce non-IIDness across clients by generating data locally at each client with shared medical image data distribution from others while protecting patient privacy. First, a variational autoencoder (VAE) is federally trained, of which the encoder is uesd to map the local original medical images into a hidden space, and the distribution information of the mapped data in the hidden space is estimated and then shared among the clients. Second, the clients augment a new set of image data based on the received distribution information with the decoder of VAE. Finally, the clients use the local dataset along with the augmented dataset to train the final classification model in a federated learning manner. Experiments on the diagnosis task of Alzheimer’s disease MRI dataset and the MNIST data classification task show that the proposed method can significantly improve the performance of federated learning under non-IID cases.


Introduction
In traditional medicine, imaging doctors need to perform many diagnostic classifications of images, which consumes considerable time and energy and may cause misjudgment due to environmental and other factors. In recent years, with the development of artificial intelligence related technologies, medical imaging disease diagnosis classification based on deep learning has gradually become one of the most promising areas for artificial intelligence due to its excellent performance.
In reality, deep learning requires large amounts of training data to achieve better performance. However, the quantity of imaging data available in a single hospital is small. Since B Jianjun Huang huangjin@szu.edu.cn Leiyang Zhao zly20212021htab@163.com 1 Guangdong Key Laboratory of Intelligent Information Processing, College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China the data contain private patient information, it is not possible to collect and organize data from multiple hospitals, which makes centralized machine learning impossible. Recently, federated learning (FL) has emerged as a new paradigm for distributed machine learning [1], which can jointly train a deep learning model with multiple data owners without data going out of local. Depending on the application scenarios, federated learning can be broadly categorized into cross-device FL and cross-silo FL, and the clients in the cross-silo setting are a small number of organizations (e.g., medical institutions) with reliable communications and abundant computing resources in datacenters [2]. We focus on cross-silo federated learning in this paper. Federated learning has been studied in medical applications such as health care [3,4], automatic classification of medical images [5], MRI reconstruction [6], and COVID-19 detection [7,8].
However, non-IID data present a serious challenge to federated learning. In the case of non-IID, the basic assumption of independent homogeneous distribution in federated learning will no longer be satisfied, and the optimization direction of the local model may be far from the global optimization direction, which inherently induces a local optimum [9]. Some algorithmic-level federated learning approaches to solve non-IID problems have recently emerged [9,10]. However, Li et al. [11] shows that these methods perform as poorly as FedAvg [12] on image datasets with deep learning models. Recently, some data-level federated learning approaches emerge as new directions for solving non-IID problems [13,14], but they can cause problems such as privacy leakage.
To better address the problem of degraded federated learning performance under non-IID image data, FedDIS is proposed. First, the client learns the distribution of all client datasets, augments each client dataset to reach IID according to the distribution, and finally trains the task model on the augmented dataset to solve the non-IID problem. The main contributions of this study include the following: (1) We propose a distribution information sharing strategy that can preserve patient privacy to improve the performance of federated learning on non-IID data sets. In this method, each client shares its individual local data distribution in the hidden space for other clients to construct an IID dataset to solve the non-IID problem. (2) We use σ -point sampling to generate more hidden space data points to enable more reliable distribution parameter estimation and better privacy protection. (3) Metrics such as distance and maximum structural similarity between sets are introduced to measure the privacypreserving ability of the proposed method. (4) We employ the difference in training performance between the generated data and the original to evaluate the effectiveness of the hidden space distribution information sharing strategy. (5) We conducted experiments to investigate the algorithm itself, including exploring the effects of different hidden space distribution types and σ -point parameters on the performance of the method. (6) We conducted comparative experiments to validate the superiority of the algorithm. The experiments consisted of two parts, which verified the advantages of the method under the cases of local data imbalance on the client side and inconsistent data distribution among the clients, respectively. The comparison results show that our method has better performance and faster convergence.
The remainder of this paper is organized as follows. "Related work" presents some related work. In "Methodology", the overall process of the method is introduced, as well as the core distribution information sharing and data generation methods. "Experiments and results" gives the details of the experimental design and results and provides a comparative analysis of the experimental results. "Conclusion and future work" concludes the paper.

Related work
In recent years, FL has emerged as a new paradigm for distributed machine learning; because of its secure and collaborative features, it has been widely used in medical and financial institutions. The problem of non-IID data causing degradation of federated learning performance is one of the main challenges for federated learning [15], and there are two types of solutions: data level and algorithm level. At the data level, there are data-based methods and datafree methods. In the data-based approach, a globally shared small IID dataset can reduce the impact of non-IID data [16], but for fields such as medicine, data sharing is difficult to implement in practical applications. Sun et al. [17] introduced a data redundancy strategy to deal with non-IID data by exchanging local data with their trusted nodes to improve classification task accuracy in non-IID conditions. KD is a technique that teaches knowledge from one or more teacher models to an empty student model. FEDDFUSION [13] is a data-based KD approach that uses generated data to aggregate heterogeneous knowledge from all received client models. For the data-free approach, FEDGEN [14] is a datafree KD approach. The server learns a generator in a data-free manner and then broadcasts it to the client to adjust the training of the local model. At the algorithm level, [12] FedAvg was the first federated learning algorithm that emerged to deal with non-IID data to a certain degree. Wang et al. [18] used the relationship between the gradient magnitude and sample quantity to estimate the data class imbalance, and a new loss function, ratio loss, was designed to increase the effect of minority class data on the results. Sarkar et al. [19] improved the performance under federated learning class imbalance by reshaping the cross-entropy loss, reducing the weight of majority class data on the model when the model has high prediction accuracy for the majority class and increasing the weight of minority class on the model accordingly. FedProx [10] uses the global model to correct the local training direction; however, related studies have concluded that FedProx has little advantage over FedAvg. Federated learning methods under non-IID data are becoming increasingly abundant; however, the cross-silo scenario is not well studied. Data-based approaches are limited in their application due to the risk of privacy breaches. The data-free approach is a new approach for solvingdime the non-IID data problem in federated learning due to its privacy-preserving ability and good performance, but the performance degradation problem on non-IID image data under federated learning has not been well studied on data-free methods.

Problem description
A large quantity of medical data is needed to train a disease classification model; however, medical data are scattered in different medical institutions and cannot be collected centrally due to privacy issues. Let X 1 , ..., X m denote the local image dataset of m medical institutions. The corresponding data distribution is denoted as P 1 , ..., P m ; thus, X = X 1 ∪ ... ∪ X m is the global dataset, and P X is the global data distribution. Assuming X i ∩ X j = ∅ for i = j, let |X i | denote the cardinality of X i . Assume a task model f (w, x, y)(e.g., a medical image classification model) is to be trained, and its corresponding model parameter is w ∈ R d . Let (x j , y j ) denote a data sample; here, x j represents the input (image) to the task model, and y j represents its corresponding output (class label). Then, the global loss function F(w) can be where F i (w) is the local loss function of client i, which is defined as Training the task model under IID conditions can obtain good performance with FedAvg, and IID data in federated learning satisfy where P i (x, y) is the data distribution of the i-th client for i = 1...,m and P X (x, y) is the global distribution. However, FedAvg suffers from performance degradation in the non-IID data case, i.e. P i (x, y) = P j (x, y), a natural idea is to construct an augmented dataset D i for each client to make its distribution identical to others distributions, such that where D i = z | z ∼ j =i P j (x, y) can be obtained by generative neural network learning such as VAE.

Method
Data sharing and data redundancy strategies can improve the performance of federated learning under non-IID conditions [16,17]. The data sharing strategy distributes a subset X G that obeys the global distribution P X to each client, and then the data of each client become X 1 ∪ X G , ..., X m ∪ X G . The data redundancy strategy stores the data of each client in K different workers that will execute federated learning; for example, when there are two clients in total and K is 2, the data stored by the two workers is X 1 ∪ X 2 , X 2 ∪ X 1 . The sharing or exchange of raw data in the above methods is not applicable in most real scenarios because this may leak data privacy. Local data generation does not leak data privacy, such as the SMOTE method, augmenting the local dataset X i by obtaining a synthetic sample set X S i by sampling at the nearest neighbors of a minority class of samples to obtain a new local dataset X i ∪ X S i for model training, but this method is not effective due to its inability to utilize other clients' data distribution; it does not change the degree of non-IID of each client data. We improve the performance of federated learning under non-IID data through indirect data sharing and construct a dataset locally on the client with the same distribution as the global data. The process of FedDIS is divided into two stages: the first one is shown in the algorithm, and the second is to execute federated learning on each client using the FedAvg algorithm.
Algorithm The first stage of FedDIS 1: Require: Number of clients m 2: for client i = 1, 2, ..., m do 3: Learn and obtain information of data distribution DI i in hidden space from the dataset X i of client i, the distribution represented by DI i is denoted as P i (x, y) 4: Upload DI i to the server 5: end for 6: The server broadcasts the distribution information set {DI 1 , DI 2 , ..., DI m } to each client 7: for client i = 1, 2, ..., m do 8: The new local dataset X i =X i ∪ X a i is obtained by augmenting the local dataset according to {DI 1 , DI 2 , ..., DI m }, here X a i ∼ j =i P j (x, y) 9: end for VAE has a good ability to learn data distributions [20]. To learn the distribution of high-dimensional medical data, we use the encoder in VAE to reduce the dimension of the original data and estimate the distribution of the encoded data in the hidden space, which does not disclose the privacy of the original data and reduces the communication cost of federated learning by reducing the quantity of data of distribution information. Let the VAE encoder and decoder be En(·) and De(·), respectively. The simple VAE structure is shown in Fig. 1.
In the implementation, the VAE is trained first, and then the En(·) is used to map the local data to the hidden space. To share distribution information (DI) of data, the local data are assumed to obey some probability distribution p(z|θ) Fig. 1 Simple structure of VAE with parameters θ in the hidden space, θ is estimated from the hidden space data, DI contains the distribution type and θ of p(z|θ), which is shared with other clients. Finally, the client augments the local data using its received distribution parameters and the decoder.

Data augmentation based on sharing of hidden space distribution
Our core approach is to obtain information about the hidden space distribution of an individual client's local dataset. A client encodes the local data into the hidden space, estimates the distribution parameters based on the reparameterized data and the distribution type, and shares them with other clients. Assume the distribution of the hidden space of a client is p(z|θ); here, z is the hidden space random vector of d-dimension, and θ is the parameter of the distribution. To estimate θ , a set of reparameterized samples are generated by using the local images of the client and VAE encoder. Let the original image dataset of the client be X = {x i |i = 1, 2, ..., k}; for each image x i in X , the VAE encoder gives where m i is a d-dimensional vector and 6 i is a d × d matrix.
Since the resampling of VAE is such that u ∼ N (0, I ) is first generated and then z = m i + 6 i u is made to be a sample of the hidden space, this is likely to produce wild values and only one sample of the hidden space is generated for each data sample, and the small number cannot be used well to estimate the distribution parameters. To obtain a more accurate estimation of the hidden space distribution parameters, reparameterization in the hidden space is conducted by using the σ -point sampling method [21] to generate a set of 2d σpoints as where √ i n is the n-th column of the square root of 6 i and α > 0 is a constant that defines the exact placement of sigma points. The set of reparameterized samples in the hidden space will be Z = {z i,n |i = 1, 2, ..., k; n = 1, ..., 2d} (7) which is used to estimate θ of the selected distribution model p(z|θ). Since the quantity of hidden space data generated using σ -point sampling is increased, the distribution of the hidden space can be better estimated, while the risk of privacy leakage of the original data samples is reduced because only at the encoded mean vector can the decoder recover data that are similar enough to the original data. When the distribution parameters are estimated, client i sends the distribution type, distribution parameters, and the number of local data elements |X i | to the server, which broadcasts them to other clients. Once a client k receives the type of hidden space distribution and corresponding parameters of other clients from the central server, a set of hidden space samples Z = z j | j = 1, 2, · · · , i =k |X i | can be generated to augment the training data This augmented dataset which generates the corresponding amount of synthetic data according to the |X i | of each other client, X is then used together with the original dataset to train the client's local task model. To explore the privacy preservation feature of the proposed method, the distance between the original dataset and the generated dataset is used, which is defined as follows: where A is the generated dataset, B is the original dataset, and d(x, B) is the distance from point x to set B, which is defined as follows: where d(x, y) is the distance between x and y and is defined as follows: d(x, y) = ||x − y|| 2 Table 1 shows the distance between the generated dataset and the original dataset under the uniform distribution type.  . 2 The boxplot of SSIM shows that the mean value is only approximately 0.65 and the maximum value is less than 0.75; thus, the privacy of the original data can be well protected The table shows that as the dimensionality of the data increases, the distance increases, and the risk of privacy leakage decreases. Therefore, for high dimensional data, the possibility of privacy exposure of the data generated with VAE is small.
For privacy protection on real image datasets, structural similarity (SSIM) is used to show that generating data samples does not reveal individual private information in the original data samples. The maximum SSIM between the generated image set and the original image set is employed as a privacy-preserving metric, which is mathematically represented as Fig. 2 shows the SSIM in an MRI dataset with 1000 generated samples. It can be seen that the SSIM of the generated data samples and the original data samples are very low, so the generated data are well secured by the original data. To more intuitively show the difference between the data generated by the distribution information and the original data, the diagram with the largest SSIM is shown in Fig. 3.
The method uses only the distribution of the dataset and does not involve individual data samples, so augmented datasets cannot disclose patient privacy, but they have the same distribution as the original dataset and, therefore, have similar  effects on model training. We test with a classification task on an MRI dataset. The difference in training performance between the generated data and the original (Dpgo) after the final convergence of the individually trained models for the original and generated data is shown in Table 2. Dpgo is defined as follows: where X is the distribution of the original dataset, X G is the distribution of the generated dataset, x is the real training sample, y is the corresponding label, w is the parameters of the task model, and l(·) is the cross-entropy loss function. From the above results, we can see that the VAE-generated data have similar training effects as the original data.

Experiments and results
In this section, we do two types of experiments. The first type of experiments is a research on the algorithm itself, including exploring the effects of different hidden space distribution types and σ -point parameters on the performance of the method, with the purpose of discovering which factors in the method can have an impact on the performance of the method. The second type of experiments is the comparison experiment, which consists of two parts, one is conducted under the condition of local data imbalance in the client, and the other is conducted under the condition of inconsistent data distribution among clients, and the purpose of the experiment is to verify the superiority of the method.

Experimental setup
Dataset: The datasets used in the experiment include the Alzheimer's disease MRI dataset and the MNIST dataset, the former is used for disease diagnosis, the later is used for digit image classifications. The original MRI dataset downloaded from Kaggle was divided into two parts, the training set and the test set, each containing four classes of data, namely, nondementia, very mild dementia, mild dementia, and moderate dementia, and data with dementia were integrated into one category. The integrated training set contains 2,561 data with dementia and 2,560 data without dementia, and the test set contains 639 data with dementia and 640 data without dementia. The original size of the images was 1×176×208, and they were resized to 1×176×176 during training and normalized. Sample image data for demented and normal individuals are shown in Fig. 4.

Non-IID settings:
For MRI dataset, it was randomly partitioned according to Table 3 and assigned to different clients to give an example of federated learning under the condition of non-IID between hospitals in a real scenario, and EMD is used to measure the degree of non-IID. Let the data distribution of the three clients be denoted as p1, p2, and p3. Table 4 shows the EMD of the data between the three clients two by two. To better compare the results after expanding the non-IID dataset with those trained directly on the IID training set, we distribute the data equally to three clients as the data IID case. In the first division, the disease-free data accounted for 20%, 20% and 60% of the total disease-free training data, and the disease data accounted for 60%, 20% and 20% of the total diseased training data for the three clients. In the second division, the disease-free data of the three clients account for 10%, 10%, and 80% of the total disease-free training data, and the disease data account for 80%, 10%, and 10% of the total disease training data, respectively. Finally, the test set divided by the original dataset is used for performance evaluation, and the same test set is used on different clients to ensure fairness. For MNIST dataset, it was partitioned according to the Dirichlet distribution [22], the concentration parameter β was set to 0.05, 0.1 and 1, respectively, in which a smaller β indicates higher data heterogeneity, the number of clients was set to 20 with an active-user ratio r = 50%.
VAE model: VAE is widely used in the processing of images [23][24][25][26], we designed a specific VAE model based on the Alzheimer's disease dataset with the structure diagram shown in Fig. 5. The main structure of the model is divided into two parts, encoder and decoder, where the encoder maps the image data into mean and variance vectors and uses the reparameterization trick [20] to obtain the hidden variables. The reparameterization technique allows the model to perform the computation of gradients and backpropagate the error from the decoder to the encoder, thus making the whole model trainable. To speed up the training and reduce the communication cost of federated learning, BatchNorm is used between all convolution and transposed convolution layers, and we use LeakyReLU as the activation function. For MNIST, we adjusted the parameters of the VAE model to fit the format of the MNIST data.
Task model: The popular neural network VGG16 was used to perform diagnostic tasks on MRI dataset. To train the network faster and minimize the communication cost of federated learning, we use a transfer learning technique that utilizes the pretrained model that comes with the PyTorch framework. It is a classification network with 1000 categories trained on ImageNet with more than 1.2 million images. Although these images are from the natural scene domain, the images generally share some low-level features, such as lines, textures, and edges. Since our dataset is a singlechannel image and the classification task is 2 classifications, we replace the input of the first convolutional layer with a single channel and the output of the last fully connected layer with 2. The CNN network is used for the MNIST classifica-

Exploratory experiments on FedDIS
This experiment was performed on the MRI dataset. Three types of distributions, including ordinary normal, truncated normal, and uniform distribution types, were used to estimate the distribution information in the hidden space, and data augmentation was performed on the client. After the data generation was completed, the final diagnostic classification experiment was performed. To explore the effect of different hidden space distribution types on the experimental results, Fig. 6 shows the test accuracy over communication rounds using different distribution types in different cases when the σ -point parameter α is fixed to 1. It can be clearly found that the test performance is better for data with a non-IID degree. different EMD cases when the distribution type is uniform, which indicates that the uniform distribution type is more suitable for the current dataset.
To better show the effect of different hidden space distribution types and σ -point parameters on the final test performance, Fig. 7 shows the boxplots of the test accuracy across 5 runs using different distribution types in different cases with different σ -point parameters. As we can see, dif- Fig. 6 Test accuracy over communication rounds using different distribution types ferent hidden space distribution types and different σ -point parameters cause differences in the final results, and the performance is better using the uniform distribution type and α = 2.

Comparison experiments
This experiment consisted of two comparisons, the first performed on the MRI dataset with client-side local data imbalance. FedDIS is compared with the FedAvg algorithm [12], the data sharing (DS) [16] and data redundancy (DR) [17] approaches, and the synthetic minority oversampling technique (SMOTE) was used to generate minority class data locally based on the distribution of local data as a comparison. SMOTE has been applied to the medical data imbalance problem [27], and we use the SMOTE technique on the client with a non-IID data problem, followed by federated learning and testing its performance for the three clients. We call this method Fed_SMOTE. For a fair comparison, experiments were conducted on the same dataset and in the exact same environment, including the following method setup. Fig. 7 Boxplots of test accuracy in different cases. N, TN, and U represent ordinary normal distribution, truncated normal distribution and uniform distribution types, respectively -Create a global shared dataset with a class uniform distribution, β set to 5% and 10%, respectively, where β = (||G||/||D||) × 100%, G represents the size of the global shared dataset and D represents the size of the total dataset, a random α portion of G are distributed to each client, and α is set to 100%, i.e., the global shared dataset is fully allocated to each client. -Data redundancy is introduced in the system, where the data redundancy is set to r = 2, representing that each copy of data is stored on 2 different clients.
The results directly on the IID training set are used as a target reference to better reflect the performance of the methods. Figure 8 shows the test accuracy of the proposed method and method of comparison as the number of communication rounds increases. To take a closer look at the variations, the boxplots show the test accuracy across 5 runs in different cases in Fig. 9. The experimental results demonstrate the effectiveness of FedDIS. We also observe a rapid decrease in the effectiveness of the Fed_SMOTE method when the data imbalance increases.
To more visually measure the difference in performance between federated and centralized learning, the δ-accuracy loss [28] is improved by subtracting the test accuracy of the centralized learning from the test accuracy of the federated learning method to better discover which of the two performs better. The δ-accuracy loss curve for each method with an increasing number of communication rounds is shown in Fig. 10. Since the proposed method generates new training data in the process of balancing the dataset, which gives the method more training data than centralized learning, the trained model has more generalization ability and, therefore, δ accuracy loss will appear to be less than zero. Tables 5 and 6 show the final test performance of the different methods in different non-IID cases. The comparison results show that the final classification accuracy of FedDIS under the best setting is higher than that of the comparison method with performance comparable to training directly on  The second comparison experiment is executed on the MNIST dataset with non-IID among clients, the baseline for comparison is FedAvg [12]. FedProx is a typical algorithmlevel approach to non-IID problems by making corrections to the gradient [10], and is a representative algorithm for correcting the local model with the global model. FEDFUSION [13] and FEDGEN [14] are typical data-level methods for data-based and data-free, respectively. methods, which are the two mainstream methods for solving non-IID problems at the data level. We run three trials and report the Top-1 test accuracy. Table 7 presents the comparison results, from the results we can see that FedDIS has a better performance  The bold number indicates the optimal performance The bold number indicates the optimal performance compared to other methods and is more superior in the more extreme non-IID cases. As shown in Fig. 11, FedDIS has the fastest convergence speed and achieves a better performance than the other methods. FEDGEN uses the generator to guide the user model during training the generator. Due to the poor quality of the generator at the beginning of training, it does not regulate the local model well, while FedDIS trains a generator independently first and reduce non-IIDness across clients before training the task model, so it has a faster convergence speed when training the task model, and because we improve the hidden space of VAE to generate higher quality data locally, FedDIS eventually achieves higher accuracy.

Conclusion and future work
In this paper, we introduced FedDIS to image classification, including the detailed design and implementation process, which can share the distribution information of the hidden space data under the condition of privacy protection and thus indirectly share the distribution information of the medical images. Using this distribution information, we can perform data generation processing for clients with non-IID data to reduce non-IIDness across clients, thus improving the performance of image classification. We conducted experiments on the Alzheimer's disease MRI image dataset and MNIST dataset, two types of experiments were done, first on the algorithm itself, including exploring the effect of different hidden space distribution types and σ -point parameters on the performance of the method. Then comparative experiments were done, under the condition of local data imbalance on the client side and under the condition of inconsistent data distribution among the clients, which effectively verified the superiority of the method. In addition, the hidden space distribution information sharing strategy can be used as a new method for sharing data knowledge in privacy-preserving situations, which has positive implications for improving the performance of traditional federated learning. For further research, we will explore ways to improve the federated learning performance in the case of heterogeneous image data from different institutions by considering the mapping of heterogeneous data to homogeneous data as an image translation problem, which can be handled using, for example, vision transformer and adversarial network models.
Funding This work is supported by Shenzhen Science and Technology Program (JCYJ20220818100004008, JCYJ20200821152629001, 20190808120417257) and partially supported by National Natural Science Foundation of China (No. 62171287).

Data availability
The Alzheimer disease dataset is publicly accesible at https://www.kaggle.com/legendahmed/alzheimermridataset Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.